Goto

Collaborating Authors

 social regret




Learning from Delayed Feedback in Games via Extra Prediction

Fujimoto, Yuma, Abe, Kenshi, Ariu, Kaito

arXiv.org Artificial Intelligence

This study raises and addresses the problem of time-delayed feedback in learning in games. Because learning in games assumes that multiple agents independently learn their strategies, a discrepancy in optimization often emerges among the agents. To overcome this discrepancy, the prediction of the future reward is incorporated into algorithms, typically known as Optimistic Follow-the-Regularized-Leader (OFTRL). However, the time delay in observing the past rewards hinders the prediction. Indeed, this study firstly proves that even a single-step delay worsens the performance of OFTRL from the aspects of social regret and convergence. This study proposes the weighted OFTRL (WOFTRL), where the prediction vector of the next reward in OFTRL is weighted $n$ times. We further capture an intuition that the optimistic weight cancels out this time delay. We prove that when the optimistic weight exceeds the time delay, our WOFTRL recovers the good performances that social regret is constant in general-sum normal-form games, and the strategies last-iterate converge to the Nash equilibrium in poly-matrix zero-sum games. The theoretical results are supported and strengthened by our experiments.


Tight Regret Upper and Lower Bounds for Optimistic Hedge in Two-Player Zero-Sum Games

Tsuchiya, Taira

arXiv.org Machine Learning

In two-player zero-sum games, the learning dynamic based on optimistic Hedge achieves one of the best-known regret upper bounds among strongly-uncoupled learning dynamics. With an appropriately chosen learning rate, the social and individual regrets can be bounded by $O(\log(mn))$ in terms of the numbers of actions $m$ and $n$ of the two players. This study investigates the optimality of the dependence on $m$ and $n$ in the regret of optimistic Hedge. To this end, we begin by refining existing regret analysis and show that, in the strongly-uncoupled setting where the opponent's number of actions is known, both the social and individual regret bounds can be improved to $O(\sqrt{\log m \log n})$. In this analysis, we express the regret upper bound as an optimization problem with respect to the learning rates and the coefficients of certain negative terms, enabling refined analysis of the leading constants. We then show that the existing social regret bound as well as these new social and individual regret upper bounds cannot be further improved for optimistic Hedge by providing algorithm-dependent individual regret lower bounds. Importantly, these social regret upper and lower bounds match exactly including the constant factor in the leading term. Finally, building on these results, we improve the last-iterate convergence rate and the dynamic regret of a learning dynamic based on optimistic Hedge, and complement these bounds with algorithm-dependent dynamic regret lower bounds that match the improved bounds.




Double Auctions with Two-sided Bandit Feedback

Basu, Soumya, Sankararaman, Abishek

arXiv.org Artificial Intelligence

Double Auction enables decentralized transfer of goods between multiple buyers and sellers, thus underpinning functioning of many online marketplaces. Buyers and sellers compete in these markets through bidding, but do not often know their own valuation a-priori. As the allocation and pricing happens through bids, the profitability of participants, hence sustainability of such markets, depends crucially on learning respective valuations through repeated interactions. We initiate the study of Double Auction markets under bandit feedback on both buyers' and sellers' side. We show with confidence bound based bidding, and `Average Pricing' there is an efficient price discovery among the participants. In particular, the regret on combined valuation of the buyers and the sellers -- a.k.a. the social regret -- is $O(\log(T)/\Delta)$ in $T$ rounds, where $\Delta$ is the minimum price gap. Moreover, the buyers and sellers exchanging goods attain $O(\sqrt{T})$ regret, individually. The buyers and sellers who do not benefit from exchange in turn only experience $O(\log{T}/ \Delta)$ regret individually in $T$ rounds. We augment our upper bound by showing that $\omega(\sqrt{T})$ individual regret, and $\omega(\log{T})$ social regret is unattainable in certain Double Auction markets. Our paper is the first to provide decentralized learning algorithms in a two-sided market where \emph{both sides have uncertain preference} that need to be learned.